01. Connect Python(Jupyter)

  • Python에서 Bigquery를 사용하는 방법은 1) gcloud의 bigquery를 사용하는 방법과 2) pandas 내장 gbq를 사용 하는 방법 2가지가 있습니다

gcloud의 bigquery 사용

  • gcloud sdk install을 따라서 설치해주세요! 그 후 아래와 같이 진행해주세요
    (터미널에서)
    pip3 install gcloud
    gcloud auth login
  • 그 후, GOOGLE_APPLICATION_CREDENTIALS을 설정해줍니다
  • gcloud auth로 하는 방법, API 사용자 인증 정보 - service_key.json을 사용하는 방법 등이 있습니다
  • json key file을 설정하는 방법은 링크를 클릭해주세요!

In [1]:
from gcloud import bigquery as bq
import uuid
import os

gcloud auth로 진행한 경우


In [2]:
client = bq.Client()

service_key.json을 사용하는 경우


In [4]:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/Path/to/service_key.json"

In [3]:
client = bq.Client()

In [ ]:
# client의 project명 print
print(client.project)

In [4]:
# dataset 설정
dataset = client.dataset("bigquery-public-data:samples")

async_query


In [5]:
query_job = client.run_async_query(str(uuid.uuid4()), """
        #standardSQL
        SELECT corpus AS title, COUNT(*) AS unique_words
        FROM `publicdata.samples.shakespeare`
        GROUP BY title
        ORDER BY unique_words DESC
        LIMIT 10""")

In [8]:
# query 실행
query_job.begin()

In [9]:
destination_table = query_job.destination
destination_table.reload()
for row in destination_table.fetch_data():
    print(row)


[('hamlet', 5318), ('kinghenryv', 5104), ('cymbeline', 4875), ('troilusandcressida', 4795), ('kinglear', 4784), ('kingrichardiii', 4713), ('2kinghenryvi', 4683), ('coriolanus', 4653), ('2kinghenryiv', 4605), ('antonyandcleopatra', 4582)]
10
None

sync_query


In [10]:
query_string = '''
        #standardSQL
        SELECT corpus AS title, COUNT(*) AS unique_words
        FROM `publicdata.samples.shakespeare`
        GROUP BY title
        ORDER BY unique_words DESC
        LIMIT 10
        '''

In [11]:
query = client.run_sync_query(query_string)
query.timeout_ms = 5 * 60 * 1000

In [12]:
query.run()

In [13]:
data = query.fetch_data()[0]

In [14]:
data


Out[14]:
[('hamlet', 5318),
 ('kinghenryv', 5104),
 ('cymbeline', 4875),
 ('troilusandcressida', 4795),
 ('kinglear', 4784),
 ('kingrichardiii', 4713),
 ('2kinghenryvi', 4683),
 ('coriolanus', 4653),
 ('2kinghenryiv', 4605),
 ('antonyandcleopatra', 4582)]

2) pandas 내장 gbq 사용


In [6]:
from pandas.io import gbq

In [8]:
query = '''
        #standardSQL
        SELECT corpus AS title, COUNT(*) AS unique_words
        FROM `publicdata.samples.shakespeare`
        GROUP BY title
        ORDER BY unique_words DESC
        LIMIT 10
        '''

In [13]:
gbq.read_gbq(query = query, project_id = 'project_id', dialect='standard')


Requesting query... ok.
Job ID: job_-HhwJshSnAMGgO8VOow5U0vRCS4G
Query running...
Query done.
Processed: 2.4 MB
Standard price: $0.00 USD

Retrieving results...
Got 10 rows.

Total time taken 2.06 s.
Finished at 2017-10-03 21:15:04.
Out[13]:
title unique_words
0 hamlet 5318
1 kinghenryv 5104
2 cymbeline 4875
3 troilusandcressida 4795
4 kinglear 4784
5 kingrichardiii 4713
6 2kinghenryvi 4683
7 coriolanus 4653
8 2kinghenryiv 4605
9 antonyandcleopatra 4582